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Abstract. A parallelization of a sweeping preconditioner for 3D Helmholtz equations without 
large cavities is introduced and benchmarked for several challenging velocity models. The setup 
and application costs of the sequential preconditioner are shown to be 0(f 2 N 4 / 3 ) and 0(-yN log N), 
where 7(0;) denotes the modestly frequency-dependent number of grid points per Perfectly Matched 
Layer. Several computational and memory improvements are introduced relative to using black-box 
sparse-direct solvers for the auxiliary problems, and competitive runtimes and iteration counts are 
reported for high-frequency problems distributed over thousands of cores. Two open-source packages 
are released along with this paper: Parallel Sweeping Preconditioner (PSP) and the underlying 
distributed multifrontal solver, Clique. 
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1. Introduction. While definite elliptic partial differential equations can be ef- 
ficiently solved by a wide variety of techniques (e.g., multigrid, ILU, or structured 
matrix factorizations) , indefinite elliptic equations tend to be more challenging. This 
paper is concerned with three-dimensional heterogeneous Helmholtz equations of the 
form, 



Au = 



»(«) = /(«), (i.i) 



c 2 (x) 

where c(x) is the spatially varying wave speed, and u(x)e~ luJt is the time-harmonic 
response to an acoustic wave equation with forcing function f(x)e~ luJt . It is important 
to recognize that —A is positive-definite and that its combination with the negative- 

2 

definite — ^ term results in an indefinite system. 

Before discussing the overall asymptotic complexity of solution techniques, it 
is helpful to first motivate why high frequency problems require large numbers of 
degrees of freedom: Given the wave speed bounds c m ; n < c(x) < c max , we can define 
the minimum wavelength as A m i n = 27rc m i n /u;. In order to resolve oscillations in the 
solution using piecewise polynomial basis functions, e.g., with finite-difference and 
finite-clement methods, it is necessary to increase the number of degrees of freedom in 
each direction at least linearly with the number of wavelengths spanned by the domain. 
In order to combat pollution effects [1], which are closely related to phase errors in 
the discrete solution, one must use asymptotically more than a constant number of 
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grid points per wavelength with standard discretization schemes. Nevertheless, it is 
common practice to resolve the domain to as few as ten points per wavelength. In 
any case, piecewise polynomial discretizations require fl(u! d ) degrees of freedom in d 
dimensions. 



Until recently, doubling the frequency of Eq. (1.1) not only increased the size of 
the linear system by at least a factor of 2 d , it also doubled the number of iterations 
required for convergence with standard preconditioned Krylov methods [HJ 1171 118] . 
Thus, denoting the number of degrees of freedom in a three-dimensional finite-element 
or finite-difference discretization as N — fl(oj 3 ), every linear solve required f2(w 4 ) work 
with traditional iterative techniques. Engquist and Ying recently introduced two 
classes of sweeping preconditioners for Helmholtz equations without large cavities [141 
ITS] : Both approaches approximate a block LDL T factorization of the Helmholtz 
operator in block tridiagonal form in a manner which exploits a radiation boundary 
condition. The first approach performs a block tridiagonal factorization algorithm 
in ^-matrix arithmetic [231 H2] i while the second approach approximates the Schur 
complements of the factorization using auxiliary problems with artificial radiation 
boundary conditions. Though the "H-matrix sweeping preconditioner has theoretical 
support for two-dimensional problems [HI [55], there is not yet justification for three- 
dimensional problems. 

This paper therefore focuses on the second approach, which relies on multifrontal 
factorizations [57] |3H E2J [H] of the approximate auxiliary problems in order to achieve 
an 0(7 2 iV 4 / 3 ) setup cost and an 0(~/N log N) application cost, where 7(0;) denotes the 
number of grid points used for each Perfectly Matched Layer (PML) [29] . While the 
sweeping preconditioner is competitive with existing techniques even for a single right- 
hand side, its main advantage is for problems with large numbers of right-hand sides, 
as the preconditioner appears to converge in O(l) iterations for problems without 
large cavities. Thus, after setting up the preconditioner, typically only 0(jN log N) 
work is required for each solution. 



1.1. Moving PML sweeping preconditioner. The focus of this paper is on 
parallelization of the second class of sweeping preconditioners mentioned above, which 
makes use of auxiliary problems with artificial radiation boundary conditions in order 
to approximate the Schur complements that arise during block LDL T factorization. 
The approach is referred to as a moving PML preconditioner since the introductory 
paper represented the artificial radiation boundary conditions using PML. 

One interpretation of radiation conditions is that they allow for the analysis of a 
finite portion of an infinite domain, as their purpose is to enforce the condition that 
waves propagate outwards and not reflect at the boundary of the truncated domain. 
This concept is crucial to understanding the Schur complement approximations that 
take place within the moving PML sweeping preconditioner which is reintroduced in 
this paper for the sake of completeness. 

For the sake of simplicity, we will describe the preconditioner in the context of 
a second-order finite-difference discretization over the unit cube, with PML used to 
approximate a radiation boundary condition on the — face and homogeneous 
Dirichlet boundary conditions applied on all other boundaries (an 2:1X3 cross-section 
is shown in Fig. |1.1[ ). If the domain is discretized into annxnxn grid, then ordering 
the vertices in the grid such that vertex (11,12,^3) is assigned index i\ + i^n + i^n 2 
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Fig. 1.1. An X1X3 cross section of a cube with PML on its £3 = {ace. The domain is shaded 
in a manner which loosely corresponds to its extension into the complex plane. 



results in a block tridiagonal system of equations, say 
/ A ,o AJ \ 
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(1.2) 



where Aij propagates sources from the i'th X1X2 plane into the j'th X1X2 plane, and 
the overall linear system is complex symmetric (not Hermitian) due to the imaginary 
terms introduced by the PML [T5] , 

If we were to ignore the sparsity within each block, then the following naive 
factorization and solve algorithms would be appropriate for a direct solver, where the 
application of S 1 " 1 implicitly makes use of the factorization of Si. 



Algorithm 1.1: Naive block tridiagonal LDL T factorization. 0(n 7 ) = 
0(N 7 / 3 ) work is required. 

So := Ao,o 
Factor So 

for i = 0, n — 2 do 

Si+i := A, i+M+ i - Ai+xjSi Aj +l i 
Factor Si+i 



While the computational complexities of Algs. |l.l] and |1.2| are significantly higher 
than multifrontal algorithms [571[T2[2T], which have 0(N 2 ) factorization complexity 
and 0(iV 4 / 3 ) solve complexity for regular three-dimensional meshes, they are the 
conceptual starting points of the sweeping preconditionerQ 

Suppose that we paused Alg. |l.l| after computing the i'th Schur complement, Si, 
where the i'th X1X2 plane is in the interior of the domain (i.e., it is not in a PML re- 
gion). Due to the ordering imposed on the degrees of freedom of the discretization, the 



x In fact, they are the starting points of both classes of sweeping preconditioners. The 'H-matrix 
approach essentially executes these algorithms with "H-matrix arithmetic. 



4 



J. POULSON ET AL. 



Algorithm 1.2: Naive block LDL T solve. 0(n 5 ) = 0(N 5 ^ 3 ) work is required. 

// Apply L- 1 

for i = 0, n — 2 do 

|_ u i+ i := u l+1 - 

// Apply D- 1 

for i = 0, n — 1 do 

_ Mi := Sr 1 ^ 

// Apply L~ T 

for i = n — 2, do 



J-l\ 



J-l 



Fig. 1.2. (Left) A depiction of the portion of the domain involved in the computation of the 
Schur complement of an x\X2 plan e (m arked with the dashed line) with respect to all of the planes 
to its left during execution of Alg. (Middle) An equivalent auxiliary problem which generates 

the same Schur complement; the original domain is truncated just to the right of the marked plane 
and a homogeneous Dirichlet boundary condition is placed on the cut. (Right) A local auxiliary 
problem for generating an approximation to the relevant Schur complement; the radiation boundary 
condition of the exact auxiliary problem is moved next to the marked plane. 



first i Schur complements are equivalent to those that would have been produced from 
applying Alg. |1 . l| to an auxiliary problem formed by placing a homogeneous Dirichlet 
boundary condition on the {i + l)'th X1X2 plane and ignoring all of the successive 
planes (see the left half of Fig. 1.2). While this observation does not immediately 
yield any computational savings, it does allow for a qualitative description of the in- 
verse of the i'th Schur complement, S^: it is the restriction of the half-space Green's 
function of the exact auxiliary problem onto the i'th x\X2 plane (recall that PML can 
be interpreted as approximating the effect of a domain extending to infinity). 

The main approximation made in the sweeping preconditioner can now be suc- 
cinctly described: since S^ 1 is a restricted half-space Green's function which incorpo- 
rates the velocity field of the first i planes, it is natural to approximate it with another 
restricted half-space Green's function which only takes into account the local velocity 
field, i.e., by moving the PML next to the i'th plane (see the right half of Fig. 1.2). 

If we use 7 (a;) to denote the number of grid points of PML as a function of the 
frequency, u, then it is important to recognize that the depth of the approximate aux- 
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iliary problems in the x% direction is $1(7)]^] If the depth of the approximate auxiliary 
problems was O(l), then combining nested dissection with the multifrontal method 
over the regular n x nx 0(1) mesh would only require 0(n 3 ) work and 0(n 2 log n) 
storage [21j . However, if the PML size is a slowly growing function of frequency, then 
applying 2D nested dissection to the quasi-2D domain requires 0(j 3 n 3 ) work and 
0(7 2 n 2 logn) storage, as the number of degrees of freedom in each front increases by 
a factor of 7 and dense factorizations have cubic complexity. 

Let us denote the quasi-2D discretization of the local auxiliary problem for the «'th 
plane as H iy and its corresponding approximation to the Schur complement Si as Si. 
Since Si is essentially a dense matrix, it is advantageous to come up with an abstract 
scheme for applying : Assuming that Hi was ordered in a manner consistent with 
the (i±, i2, 13) i-M"i + iin + i^n 2 global ordering, the degrees of freedom corresponding 
to the i'th plane come last and we find that 

H- 1 =(* £*i V (1.3) 



» * 3, 

where the irrelevant portions of the inverse have been marked with a *. Then, trivially. 

:-.)(:m^,>' 

which implies a method for quickly computing Ui given a factorization of Hf. 

Algorithm 1.3: Application of S^ 1 to Ui given a multifrontal factorization of 
Hi. 0(7 2 n 2 logn) work is required. 

Form Ui as the extension of Ui by zero over the artificial PML 
Form i>i := ttj 

Extract ST Ui from the relevant entries of & 



From now on, we write Tj to refer to the application of the ( approximate ) inverse 
of the Schur complement for the i 'th plane. 

There are two more points to discuss before defining the full serial algorithm. 
The first is that, rather than performing an approximate LDL T factorization using 
a discretization of A, the preconditioner is instead built from a discretization of an 
artificially damped version of the Helmholtz operator, say 



J 



(to + iaf 



c 2 (x) 



(1.5) 



where a s» 2ir is responsible for the artificial damping. This is in contrast to shifted 
Laplacian preconditioncrs [5, 16 , where a is typically 0(ui) |18j . and our motivation 
is to avoid introducing large long-range dispersion error by damping the long range 
interactions in the preconditioner. Just as A refers to the discretization of the original 
Helmholtz operator, A, we will use J to refer to the discretization of the artificially 
damped operator, J . 



2 In all of the experiments in this paper, 7(0)) was either 5 or 6, and the subdomain depth never 
exceeded 10. 
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The second point is much easier to motivate: since the artificial PML in each ap- 
proximate auxiliary problem is of depth j{u>), processing a single plane at a time would 
require processing O(n) subdomains with 0(j 3 n 3 ) work each. Clearly, processing 
0(7) planes at once has the same asymptotic complexity as processing a single plane, 
and so combining 0(7) planes reduces the setup cost from 0(-f 3 N 4 ^ 3 ) to 0(7 2 iV 4 / 3 ). 
Similarly, the memory usage becomes 0(jN log N) instead of 0{^/ 2 N log N) j^] If we 
refer to these sets of 0(7) contiguous planes as panels, which we label from to to — 1, 
where to = 0(71/7), an d correspondingly redefine {ui}, {fi}, {Si}, {T,}, and {Hi}, we 
have the following algorithm for setting up an approximate block LDL T factorization 
of the discrete artificially damped Helmholtz operator: 



Algorithm 1.4: Setup phase of the sweeping preconditioner. 0(7 2 iV 4 / 3 ) work 
is required. 

So '■— Jo.Q 

Factor So 

for i = 1 , . . . , to — 1 do 

Form Hi by prefixing PML to 
Factor Hi 



Once the preconditioner is set up, it can be applied using a straightforward mod- 
ification of Alg. |1.2| which avoids redundant solves by combining the application of 
L- 1 and D-h 



Algorithm 1.5: Application of the sweeping preconditioner. 0(-fN log N) 
work is required. 

// Apply L^ 1 and D _1 
for i — 0, to — 2 do 

Ui := TiU, 
_ Ui+i := Ui + i — Ji + i,iU. L 

V"m—1 ■ t^m— 1 

// Apply L- T 

for i = to — 2, do 

[_ m := u, - Ti(Jf +hi u i+ i) 



Given that the preconditioner can be set up with 0(7 2 Af 4 / 3 ) work, and applied 
with 0(7_/VlogiV) work, it provides a near-linear scheme for solving Helmholtz equa- 
tions if only O(l) iterations are required for convergence. The experiments in this 
paper seem to indicate that, as long as the velocity model does not include a large 
cavity, the sweeping preconditioner indeed only requires O(l) iterations. 

Though this paper is focused on the parallel solution of Helmholtz equations, 
which are the time-harmonic form of acoustic wave equations, Tsuji et al. have shown 
that the moving PML sweeping preconditioner is equally effective for time-harmonic 
Maxwell's equations [38, 155] , and we believe that the same will hold true for time- 
harmonic linear elasticity. The rest of the paper will be presented in the context of 



3 Note that increasing the number of planes per panel provides a mechanism for interpolating 
between the sweeping preconditioner and a full multifrontal factorization. 
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the Helmholtz equation, but we emphasize that the parallelization should carry over 
to more general wave equations in a conceptually trivial way. 

1.2. Related work. A domain decomposition variant of the sweeping precon- 
ditioner was recently introduced [37j which results in fast convergence rates, albeit at 
the expense of requiring PML padding on both sides of each subdomain. Recalling our 
previous analysis with respect to the PML size, 7, the memory usage of the precon- 
ditioner scales linearly with the PML size, while the setup cost scales quadratically. 
Thus, the domain decomposition approach should be expected to use twice as much 
memory and require four times as much work for the setup phase. On the other hand, 
doubling the subdomain sizes allows for more parallelism in both the setup and solve 
phases, and less sweeps seem to be required. 

Another closely related method is the Analytic ILU factorization |19j . Like the 
sweeping preconditioner, it uses local approximations of the Schur complements of the 
block LDL T factorization of the Helmholtz matrix represented in block tridiagonal 
form. There are two crucial differences between the two methods: 

• Roughly speaking, AILU can be viewed as using Absorbing Boundary Con- 
ditions (ABC's) [T3] instead of PML when forming approximate subdomain 
auxiliary problems. While ABC's result in strictly 2D local subproblems, 
versus the quasi-2D subdomain problems which result from using PML, they 
are well-known to be less effective approximations of the Sommerfeld radia- 
tion condition (and thus the local Schur complement approximations are less 
effective). The sweeping preconditioner handles its non-trivial subdomain 
factorizations via a multifrontal algorithm. 

• Rather than preconditioning with an approximate LDL T factorization of the 
original Helmholtz operator, the sweeping preconditioner sets up an approx- 
imate factorization of a slightly damped Helmholtz operator in order to mit- 
igate the dispersion error which would result from the Schur complement 
approximations. 

These two improvements are responsible for transitioning from the 0(u>) iterations 
required with ARU to the O(l) iterations needed with the sweeping preconditioner 
(for problems without large cavities). 

Two other iterative methods warrant mentioning: the two-grid shifted-Laplacian 
approach of [5] and the multilevel- ILU approach of [B]. Though both require 0(uj) 
iterations for convergence, they have very modest memory requirements. In par- 
ticular, [5] demonstrates that, with a properly tuned two-grid approach, large-scale 
heterogeneous 3D problems can be solved with impressive timings. 

There has also been a recent effort to extend the fast-direct methods presented 
in [43] from definite elliptic problems into the realm of low-to-moderate frequency 
time- harmonic wave equations [101 EI] • In particular, their experiments (see Table 3 
of [ID]) suggest an asymptotic complexity of roughly 0(N 1 - 8 ), which is a noticeable 
improvement over the 0(N 2 ) complexity of traditional 3D sparse-direct methods. 

2. Parallel sweeping preconditioner. The setup and application stages of 



the sweeping preconditioner (Algs. 1.4 and 1.5) essentially consist of to = 0(n/j) 
multifrontal factorizations and solves, respectively. The most important detail is that 
the subdomain factorizations can be performed in parallel, while the subdomain solves 
must happen sequentially^ When wc also consider that each subdomain factorization 



4 While it is tempting to try to expose more parallelism with techniques like cyclic reduction 
(which is a special case of a multifrontal algorithm), their straightforward application destroys the 
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Fig. 2.1. A separator-based supernodal elimination tree (right) over a quasi-2D subdomain (left). 




Fig. 2.2. Overlay of the process ranks (in binary) of the owning subteams of each supernode 
from the elimination tree in Fig. \2. 1\ when the tree is assigned to eight processes using a subtree-to- 
subteam mapping; a '*' is used to denote both and 1, so that '00*' represents processes and 1, 
'01* ' represents processes 2 and 3, and '* * * ' represents all eight processes. 



requires 0(7 3 n 3 ) work, while subdomain solves only require 0(7 2 n 2 log n) work, we 
see that, relative to the subdomain factorizations, subdomain solves must extract a 
factor of m = 0(71/7) more parallelism from a factor of 0(7/1/ log n) less operations. 
We thus have a strong hint that, unless the subdomain solves are carefully handled, 
they will be the limiting factor in the scalability of the sweeping preconditioner. 

2.1. Parallel multifrontal algorithms. While a large number of techniques 
exist for parallelizing multifrontal factorizations and triangular solves, we focus on 
parallelizations which combine subtree-to-subteam j20j mappings of processes to the 
elimination tree [34j that also make use of two-dimensional distributions of the frontal 
matrices 35 More specifically, we make use of supernodal [13! elimination trees 
defined through nested dissection (see Figs. 2.1 and 2.2), which have been shown 
to result in highly scalable factorizations [24, 23J and moderately scalable triangular 
solutions [26] . 

Roughly speaking, the analysis in shows that, if pf processes are used in the 
multifrontal factorization of our quasi-2D subdomain problems, then we must have 
777 = fi(p^/ 2 ) in order to maintain constant efficiency as pp is increased; similarly, if pg 
processes are used in the multifrontal triangular solves for a subdomain, then we must 
have 771 Q(ps) (where we use ~ to denote that the equality holds within logarithmic 
factors). Since we can simultaneously factor the m = 0(71/7) subdomain matrices, we 



Schur complement properties that we exploit for our fast algorithm. 

5 Cf. pQ, which advocates for only distributing the root frontal matrix two-dimensionally and 
using a one-dimensional distribution for all other fronts. 
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denote the total number of processes as p and set ps — P and pp = 0(p/m); then the 
subdomain factorizations only require that n 3 — il(p/j), while the subdomain solves 
have the much stronger constraint that n sa Vt{p/^). This last constraint should be 
considered unacceptable, as we have the conflicting requirement that n 3 w 0(p/j) in 
order to store the factorizations in memory. It is therefore advantageous to consider 
more scalable alternatives to standard multifrontal triangular solves, even if they 
require additional computation. 

2.2. Selective inversion. The lackluster scalability of dense triangular solves 
is well known and a scheme known as selective inversion was introduced in [32| and 
implemented in [31] specifically to avoid the issue; the approach is characterized by 
directly inverting every distributed dense triangular matrix which would have been 
solved against in a normal multifrontal triangular solve. Thus, after performing selec- 
tive inversion, every parallel dense triangular solve can be translated into a parallel 
dense triangular matrix-vector multiply. 

Suppose that we have paused a multifrontal LDL T factorization just before pro- 
cessing a particular front, F, which corresponds to some supernode, S. Then all of the 
fronts for the descendants of iS have already been handled, and F can be partitioned 
as 



where Ftl holds the Schur complement of supernode S with respect to all of its 
descendants, Fbl represents the coupling of S and its descendants to <S's ancestors, 
and Fbr holds the Schur complement updates from the descendants of S for the 
ancestors of S. Using hats to denote input states, e.g., Ftl to denote the input state 
of Ftl i the first step in processing the frontal matrix F is to overwrite Ftl with its 
LDL T factorization, which is to say that Ftl is overwritten with the strictly lower 
portion of a unit lower triangular matrix Lp and a diagonal matrix Dp such that 
Ftl — LpDpL 1 ^. 

The partial factorization of F can then be completed via the following steps: 

1. Solve against Lp" to form Fbl '■= FblL f t . 

2. Form the temporary copy Zbl '■= Fbl- 

3. Finalize the coupling matrix as Fbl '■= FblDJ, 1 . 

4. Finalize the update matrix as Fbr := Fbr — FblFtl^bl = Fbr — ZblFq L . 
After adding Fbr onto the parent frontal matrix, only Ftl and Fbl are needed in 
order to perform a multifrontal solve. For instance, applying L^ 1 to some vector x 
proceeds up the elimination tree (starting from the leaves) in a manner similar to the 
factorization; after handling all of the work for the descendants of some supernode S, 
only a few dense linear algebra operations with <S's corresponding frontal matrix, say 
F, axe required. Denoting the portion of x corresponding to the degrees of freedom 
in supernode S by x$, we must perform: 

1. x s := L F x x s 

2. x v = -Fblxs 

3. Add xjj onto the entries of x corresponding to the parent supernode. 

The key insight of selective inversion is that, if we invert each distributed dense unit 
lower triangular matrix Lp in place, all of the parallel dense triangular solves in a 
multifrontal triangular solve are replaced by parallel dense matrix-vector multiplies. It 
is also observed in [32] that the work required for the selective inversion is typically 
only a modest percentage of the work required for the multifrontal factorization, and 




(2.1) 
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that the overhead of the selective inversion is easily recouped if there are several 
right-hand sides to solve against. 

Since each application of the sweeping preconditioner requires two multifrontal 
solves for each of the m — 0(n/j) sub-domains, which are relatively small and likely 
distributed over a large number of processes, selective inversion will be shown to 
yield a very large performance improvement. We also note that, while it is widely 
believed that direct inversion is numerically unstable, in |llj Druinsky and Toledo 
provide a review of (apparently obscure) results dating back to Wilkinson (in (I2"] 1 
which show that x := inv(A)*b is as accurate as a backwards stable solve if reasonable 
assumptions are met on the accuracy of inv(A). Since inv(A)*6 is argued to be more 
accurate when the columns of mv(A) have been computed with a backwards-stable 
solver, and both mv(Fxz) and inv(F^ L ) must be applied after selective inversion, it 
might be worthwhile to modify selective inversion to compute and store two different 
inverses of each Ftl- one by columns and one by rows. 

2.3. Global vector distributions. The goal of this subsection is to determine 
an appropriate scheme for distributing global vectors, i.e., ones representing a function 
over the entire domain (as opposed to only over a panel) . And while the factorizations 
themselves may have occurred on subteams of 0(p/m) processes each, in order to make 
use of all available processes for every subdomain solve, at this point we assume that 
each auxiliary problem's frontal tree has been selectively inverted and is distributed 
using a subtree-to-subteam mapping (recall Fig. 2.2 1 over the entire set of p processes]^] 



Since a subtree-to-subteam mapping will assign each supernode of an auxiliary 
problem to a team of processes, and each panel of the original 3D domain is by 
construction a subset of the domain of an auxiliary problem, it is straightforward to 
extend the supernodal subteam assignments to the full domain. We should then be 
able to distribute global vectors so that no communication is required for readying 
panel subvectors for subdomain solves (via extension by zero for interior panels, and 
trivially for the first panel). Since our nested dissection process does not partition 



in the shallow dimension of quasi-2D subdomains (see Fig. 2.1), extending a vector 
defined over a panel of the original domain onto the PML-padded auxiliary domain 
simply requires individually extending each supernodal subvector by zero in the £3 
direction. 

Consider an element-wise two-dimensional cyclic distribution |30j of a frontal 
matrix F over q processes using an r x c process grid, where r and c are 0(^/q). Then 
the entry will be stored by the process in the (i mod r,j mod c) position in the 
process grid. Using the notation from [30], this distributed front would be denoted 
as F[Mc,M R ], while its top-left quadrant would be referred to as F TL [M C ,M R ] (see 



Fig. 2.3 for a depiction of an [Mc,Mr] matrix distribution). 

Loosely speaking, F[X, Y] means that each column of F is distributed using the 
scheme denoted by X, and each row is distributed using the scheme denoted by Y. 
For the element-wise two-dimensional distribution used for F, [Mc, Mr], we have that 
the columns of F are distributed like Matrix Columns (Mc), and the rows of F are 
distributed like Matrix Rows (Mr). While this notation may seem vapid when only 
working with a single distributed matrix, it is useful when working with products of 
distributed matrices. For instance, if a is used to represent rows/columns being 
redundantly stored (i.e., not distributed), then the result of every process multiplying 



6 In cases where the available solve parallelism has been exhausted but the problem cannot be 
solved on less processes due to memory constraints, it would be preferable to solve with subdomains 
stored on subsets of processes. 
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Fig. 2.3. Overlay of the owning process ranks of an 7 X 7 matrix distributed over a 2 X 3 
process grid in the [Mc> Mr] distribution, where Mq assigns row i to process row i mod 2, and Mr 
assigns column j to process column i mod 3 (left). The process grid is shown on the right. 
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Fig. 2.4. Overlay of the owning process ranks of a vector of height 7 distributed over a 2 X 3 
process grid in the \Vc,*] vector distribution (left) and the [Vr,*] vector distribution (right). 



its local submatrix of A[X, *] with its local submatrix of £?[*, Y] forms a distributed 
matrix C[X, Y] = (AB)[X,Y] — A[X, *] £?[*, Y], where the last expression refers to 
the local multiplication process. 

We can now decide on a distribution for each supernodal subvector, say Xs, 
based on the criteria that it should be fast to form Ftlxs & n d F^ L xs in the same 
distribution as x$, given that F?l is distributed as Ftl[Mc, Mr}. Suppose that we 
define a Column-major Vector distribution (Vc) of x$, say Xs[Vc)*], to mean that 
entry i is owned by process i mod q, which corresponds to position (i mod r, \ i/r\ mod 
c) in the process grid (if the grid is constructed with a column-major ordering of the 



process ranks; see the left side of Fig. 2.4). Then a call to MPI_Allgather [10] within 
each row of the process grid would allow for each process to collect all of the data 
necessary to form xs[Mc,*\, as for any process row index s€{0,l,...,r — 1}, 



c-l 



{i e N : i mod r = s} = (J {i e N : i mod q = s + tr}. (2.2) 



See the left side of Fig. 2.5 for an example of an [Mq, *] distribution of a 7 x 3 matrix. 



Similarly, if x$ was distributed with a Row-major Vector distribution (Vr), as 



shown on the right side of Fig. 2.4 say xs[Vr,*], so that entry i is owned by the process 
in position (LV C J m od r,i mod c) of the process grid, then a call to MPI_Allgather 
within each column of the process grid would provide each process with the data nec- 
essary to form xs [Mr, *] . Under reasonable assumptions, both of these redistributions 
can be shown to have per-process communication volume lower bounds of f2(n/y / p) 
(if Ftl is n x n) and latency lower bounds of Q,{\og 2 {^/p)) [9]. We also note that 
translating between xsfVcs*] and x^Vr,*] simply requires permuting which process 
owns each local subvector, so the communication volume would be 0(n/p), while the 
latency cost is 0(1). 
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Fig. 2.5. Overlay of the owning process ranks of a vector of height 7 distributed over a 2 X 3 
process grid in the [Mq,*] distribution (left) and the [Mr,*] distribution (right). 



We have thus described efficient techniques for redistributing ^[Vc*] to both 
the x$ [Mr,*] and xs[Mc,*\ distributions, which are the first steps for our parallel 
algorithms for forming Ftl%s and Frf L xs, respectively: Denoting the distributed 
result of each process in process column t <E {0, 1, c — 1} multiplying its local sub- 
matrix of Ftl[Mc, Mr] by its local subvector of xs[Mr,*\ as z'''[Mc,*], it holds 



that (Ftlxs)[Mc, *] — J2t=o z ^ [Mc> *]• Since Eq. (2.2) also implies that each pro- 
cess's local data from a [Vc,*] distribution is a subset of its local data from a [Me 1 ,*] 
distribution, a simultaneous summation and scattering of {z^ [Mq, *]}t=o within pro- 
cess rows, perhaps via MPI_Reduce_scatter or MPI_Reduce_scatter_block, yields 
the desired result, (Ftl%s)[Vc,*\- An analogous process with (Ftl[Mc, Mr]) t — 
Fr£ L [Mn, Mc] and xs[Mc,*\ yields (Fj^ l xs)[Vr,*], which can then be cheaply per- 
muted to form [F^ L xs)\Yc, *]. Both calls to MPI_Reduce_scatter_block can be 
shown to have the same communication lower bounds as the previously discussed 
MPI_Allgather calls [5]. 

As discussed at the beginning of this section, defining the distribution of each 
supernodal subvector specifies a distribution for a global vector, say [£,*]. While 



the [Vc,*] distribution shown in the left half of Fig. 2.4 simply assigns entry i of 
a supernodal subvector xs, distributed over q processes, to process i mod q, we can 
instead choose an alignment parameter, tr, where < a < q, and assign entry i to 
process (i + a) mod q. If we simply set a = for every supernode, as the discussion 
at the beginning of this subsection implied, then at most 0(jn) processes will store 
data for the root separator supernodes of a global vector, as each root separator only 
has 0(771) degrees of freedom by construction. However, there are m = 0(71/7) root 
separators, so we can easily allow for up to 0(n 2 ) processes to share the storage of 
a global vector if the alignments arc carefully chosen. It is important to notice that 
the top-left quadrants of the frontal matrices for the root separators can each be 
distributed over 0(7 2 n 2 ) processes, so 0(7 2 n 2 ) processes can actively participate in 
the corresponding triangular matrix-vector multiplications. 

2.4. Parallel preconditioned GMRES(k). Since, by hypothesis, only O(l) 
iterations of GMRES(fc) will be required for convergence with the sweeping precon- 
ditioner, a cursory inspection of Algorithm |1.5| reveal that most of the work in a 
preconditioned iterative method, such as GMRES(fc), will be performed in the mul- 
tifrontal solves during the preconditioner application, but a modest portion will also 
be spent in sparse matrix-vector multiplication with the discrete Helmholtz operator, 
A, and the off-diagonal blocks of the discrete artificially damped Helmholtz operator, 
J. It is thus important to parallelize the sparse matrix-vector multiplies, but it is 
not crucial that the scheme be optimal, and so we simply distribute A and J in the 
same manner as vectors, i.e., with the [G,*\ distribution derived from the auxiliary 
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problems' frontal distributions. 

For performance reasons, it is beneficial to solve as many right-hand sides simul- 
taneously as possible: both the communication latency and the costs of loading the 
local data from frontal and sparse matrices from main memory can be amortized over 
all of the right-hand sides. Another idea is to extend the so-called trsm algorithm 
for triangular solves with many right-hand sides (i.e., more right-hand sides than pro- 
cesses) , which is well-known in the field of dense linear algebra [3U] , into the realm of 
sparse-direct solvers via the dense frontal triangular solves. This approach was not 
pursued in this paper due to the modest storage space available on Lonestar and is 
left for future work. Another performance improvement might come from exploiting 
block variants of GMRES [3j5], which can potentially lower the number of required 
iterations. 

2.5. Clique. In order to implement the previously discussed techniques for scal- 
able multifrontal factorizations and solves (via selective inversion), an open-source 
distributed multifrontal solver named Clique was built on top of Elemental |30| . a 
library for distributed-memory dense linear algebra. In addition to being designed to 
support the techniques we discussed above: selective inversion, subtree-to-submesh 
mappings, and two-dimensional frontal matrix distributions, it was also written with 
a strong emphasis on memory scalability. This is because the sweeping preconditioner 
requires large numbers of factorizations of relatively small sparse matrices, and so it is 
crucial that the per-process memory usage for each subdomain factorization decreases 
inversely with the total number of processes. 

We note that Clique was designed specifically to provide a memory-scalable mul- 
tifrontal implementation for our parallel sweeping preconditioner, and so there is not 
yet support for pivoting. We plan to add a pivoted LU factorization in the near future. 

2.6. Parallel Sweeping Preconditioner (PSP). Given the discussion in Sec- 
tion [2j it is most convenient to describe our prototype implementation of a parallel 
sweeping preconditioner based upon its deviations from our proposed approach. The 
primary difference is that there is not yet support for simultaneously factoring the 
subdomain auxiliary problems and then redistributing each frontal tree to the entire 
set of processes. This will certainly lead to large improvements in the scalability of 
the setup phase, but it is left for future work. 

3. Experimental results. Our experiments were performed on the Texas Ad- 
vanced Computing Center (TACC) machine, Lonestar, which is comprised of 1,888 
compute nodes, each equipped with two hex-core 3.33 GHz processors and 24 GB of 
memory, which are connected with QDR InfiniBand using a fat-tree topology. Our 
tests launched eight MPI processes per node in order to provide each MPI process 
with 3 GB of memory. 

Our experiments took place over five different 3D velocity models: 

• A uniform background with a high-contrast barrier. The domain is the unit 
cube and the wave speed is 1 except in [0,1] x [0.25,0.3] x [0,0.75], where it 
is 10 10 . 

• A wedge problem over the unit cube, where the wave speed is set to 2 if 
Z < 0.4 + 0.1x2, 1.5 if otherwise Z < 0.8 — 0.2a?2, and 3 in all other cases. 

• A two-layer model defined over the unit cube, where c = 4 if xi < 0.5, and 
c = 1 otherwise. 

• A waveguide over the unit cube: c(x) = 1.25(1 - o^e- 32 ^ 1 - - 5 ! 2 ^ 2 - - 5 ! 2 '). 



The SEG/EAGE Overthrust model 0, see Fig. 3.2 
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velocity model 


barrier 


wedge 


two-layers 


waveguide 


Hz 


50 


75 


50 


37.5 


PML amplitude 


3.0 


4.0 


4.0 


2.0 


iterations 


28 


49 


48 


52 



Table 3.1 



The number of iterations required for convergence for four model problems (with four forcing 
functions per model). The grid sizes were 500 3 and roughly 50 wavelengths were spanned in each 
direction. Five grid points were used for all PML discretizations, four planes were processed per 
panel, and the damping factors were all set to 7. 



In all of the following experiments, the shortest wavelength of each model is 
resolved with roughly ten grid points and the performance of the preconditioner is 
tested using the following four forcing functions: 

• a single shot centered at x , /o( x ) = ne~ 10 "ll x ~ x °ll , 



three shots, fi (x) = J2i=o ne 



-10n||x- 



• a Gaussian beam centered at x 2 in direction d, /2( x ) = e I " x d e _4w ll x_X2 H , 
and 

• a plane wave in direction d, /s(x) = e^ x ' d , 

where x = (0.5,0.5,0.1), xi = (0.25,0.25,0.1), x 2 = (0.75,0.75,0.5), and d = 
(1,1, — l)/v3. Note that, in the case of the Overthrust model, these source loca- 
tions should be interpreted proportionally (e.g., an X3 value of 0.1 means a depth 
which is 10% of the total depth of the model) . Due to the oscillatory nature of the 
solution, we simply choose the zero vector as our initial guess in all experiments. 

The first experiment was meant to test the convergence rate of the sweeping 
preconditioner over domains spanning 50 wavelengths in each direction (resolved to 
ten points per wavelength) and each test made use of 256 nodes of Lonestar. During 
the course of the tests, it was noticed that a significant amount of care must be taken 
when setting the amplitude of the derivative of the PML takeoff function (i.e., the 
"C" variable in Eq. (2.1) from [15]). For the sake of brevity, hereafter we refer to 
this variable as the PML amplitude. A modest search was performed in order to 
find a near-optimal value for the PML amplitude for each problem. These values are 
reported in Table 3.1 along with the number of iterations required for the relative 
residuals for all four forcing functions to reduce to less than 10~ 5 . 

It was also observed that, at least for the waveguide problem, the convergence 
rate would be significantly improved if 6 grid points of PML were used instead of 
5. In particular, the 52 iterations reported in Table |3.1| reduce to 27 if the PML 
size is increased by one. Interestingly, the same number of iterations are required for 
convergence of the waveguide problem at half the frequency (and half the resolution) 
with five grid points of PML. Thus, it appears that the optimal PML size is a slowly 
growing function of the frequency^] We also note that, though it was not intentional, 
each of the these first four velocity models is invariant in one or more direction, and 
so it would be straightforward to sweep in a direction such that only O(l) panel 
factorizations would need to be performed, effectively reducing the complexity of the 
setup phase to 0(j 2 N). 

The last experiment was meant to simultaneously test the convergence rates and 
scalability of the new sweeping preconditioner using the Overthrust velocity model 



7 A similar observation is also made in 1371 . 
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Fig. 3.1. A single X2X3 plane from each of the four analytical velocity models over a 500 3 
grid with the smallest wavelength resolved with ten grid points. (Top-left) the three-shot solution for 
the barrier model near x\ = 0.7, (bottom-left) the three-shot solution for the two-layer model near 
xi = 0.7, (top-right) the single-shot solution for the wedge model near xi = 0.7, and (bottom-right) 
the single-shot solution for the waveguide model near x\ = 0.55. 



(see Fig. 3.2) at various frequencies, and the results are reported in Table 3.2 It 



is important to notice that this is not a typical weak scaling test, as the number of 
grid points per process was fixed, not the local memory usage or computational load, 
which both grow superlinearly with respect to the total number of degrees of freedom. 
Nevertheless, including the setup phase, it took less than three minutes to solve the 
3.175 Hz problem with four right-hand sides with 128 cores, and just under seven and 
a half minutes to solve the corresponding 8 Hz problem using 2048 cores. Also, while 
it is by no means the main message of this paper, the timings without making use 
of selective inversion are also reported in parentheses, as the technique is not widely 
implemented^] 

4. Conclusions. A parallelization of the moving PML sweeping preconditioner 
has been presented which has allowed us to efficiently solve 3D Helmholtz equations in 
parallel with essentially 0(1) iterations, with the only observed frequency-dependence 
arising from a moderate growth in the PML size with increasing frequency. This size of 
the PML, ) was explained to result in a linear growth in the memory requirements 
of the preconditioner and a quadratic growth in the setup cost. Results were then 
presented for a variety of models, one of which had a velocity field which varied by 



Other than Clique, the only other implementation appears to be in DSCPACK |31| . 
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Fig. 3.2. Three cross-sections of the SEG/EAGE Overthrust velocity model, which represents 
an artificial 20 km X 20 km X 4.65 km domain, containing an overthrust fault, using samples ev- 
ery 25 m. The result is an 801 X 801 X 187 grid of wave speeds varying dis continuously between 
2.179 km/sec (blue) and 6.000 km/sec (red). 





number of processes 


128 


256 


512 


1024 


2048 


Hz 


3.175 


4 


5.04 


6.35 


8 


grid 


319^x75 


401^x94 


505^x118 


635^x145 


801^x187 


setup (sec) 


48.40 (46.23) 


66.33 (63.41) 


92.95 (86.90) 


130.4 (120.6) 


193.0 (175.2) 


apply (scc/rhs) 


0.468 (1.07) 


0.550 (1.28) 


0.645 (2.40) 


0.700 (3.33) 


0.880 (6.13) 


3 digits (iter's) 


42 


44 


42 


39 


40 


4 digits (iter's) 


54 


57 


57 


58 


58 


5 digits (iter's) 


63 


69 


70 


68 


72 



Table 3.2 



Convergence rates and timings on TAGC's Lonestar for the SEG/EAGE Overthrust model, 
where timings in parentheses do not make use of selective inversion. All cases used a double- 
precision second-order stencil with five grid spacings for all PML (with an amplitude of 7.5), and 
a damping parameter of 2.25iz. The preconditioner was configured with four planes per panel and 
eight processes per node, and the 'apply' timings are with respect to a single application of the 
preconditioner to four right-hand sides. 



ten orders of magnitude, and convergence was shown to be essentially independent of 
frequency for the challenging Overthrust model. 

Also, despite the requirement that each panel must be solved against one at a 
time when applying the preconditioner, a custom approach was introduced and imple- 
mented which eliminates most of the communication associated with performing tra- 
ditional black-box sparse-direct factorizations and solves over each subdomain. These 
implementations are now released as part of the open-source packages Clique and 
Parallel Sweeping Preconditioner (PSP). There are at least five important directions 
for future work: 

• developing a heuristic for tailoring the PML profile to the velocity field, 

• extending the preconditioner to more general discretizations and time-harmonic 
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Fig. 3.3. Three planes from an 8 Hz solution with the Overthrust model at its native resolution, 
801 X 801 X 187, with a single localized shot at the center of the x\X2 plane at a depth of 456 m: 
(top) a X2X3 plane near x\ = 14 km, (middle) an X1X3 plane near X2 = 14 km, and (bottom) an 
x\X2 plane near £3 = 0.9 km. 
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wave equations, 

• finding a fast preconditioner for problems with large cavities (perhaps through 
more general local auxiliary problems), 

• testing the performance improvements resulting from simultaneously factor- 
ing the subdomain problems and then redistributing the frontal trees, as well 
as a trsm approach to solving many right-hand sides, and 

• carefully studying the spectrum of the preconditioned operator for various 
classes of velocity models. 

Availability. The distributed dense linear algebra library, Elemental, is available 
under the New BSD License at http://code.google.eom/p/elemental. The dis- 
tributed multifrontal solver, Clique, is available under the GPLv3 at http : //github . 



com/poulson/Clique . The Parallel Sweeping Preconditioner (PSP) code is available 



under the GPLv3 at http://github.com/poulson/PSP. 
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